---
id: multimodal-metrics-tool-correctness
title: Multimodal Tool Correctness
sidebar_label: Multimodal Tool Correctness
---

<head>
  <link
    rel="canonical"
    href="https://deepeval.com/docs/multimodal-metrics-tool-correctness"
  />
</head>

import Equation from "@site/src/components/Equation";

The multimodal tool correctness metric is an agentic LLM metric that assesses your multimodal LLM agent's function/tool calling ability. It is calculated by comparing whether every tool that is expected to be used was indeed called.

:::info
The `MultimodalToolCorrectnessMetric` allows you to define the **strictness** of correctness. By default, it considers matching tool names to be correct, but you can also require input parameters and output to match.
:::

## Required Arguments

To use the `MultimodalToolCorrectnessMetric`, you'll have to provide the following arguments when creating an [`MLLMTestCase`](/docs/evaluation-test-cases#mllm-test-case):

- `input`
- `actual_output`
- `tools_called`
- `expected_tools`

The `input` and `actual_output` are required to create an `MLLMTestCase` (and hence required by all metrics) even though they might not be used for metric calculation. Read the [How Is It Calculated](#how-is-it-calculated) section below to learn more.

## Usage

```python
from deepeval.metrics import MultimodalToolCorrectnessMetric
from deepeval.test_case import MLLMTestCase, ToolCall

test_case = MLLMTestCase(
    input="What's in this image?",
    actual_output="The image shows a pair of running shoes.",
    # Replace this with the tools that was actually used by your LLM agent
    tools_called=[ToolCall(name="ImageAnalysis"), ToolCall(name="ToolQuery")],
    expected_tools=[ToolCall(name="ImageAnalysis")],
)

metric = MultimodalToolCorrectnessMetric()
metric.measure(test_case)
print(metric.score)
print(metric.reason)
```

There are **SEVEN** optional parameters when creating a `MultimodalToolCorrectnessMetric`:

- [Optional] `threshold`: a float representing the minimum passing threshold, defaulted to 0.5.
- [Optional] `evaluation_params`: A list of `ToolCallParams` indicating the strictness of the correctness criteria, available options are `ToolCallParams.INPUT_PARAMETERS` and `ToolCallParams.OUTPUT`. For example, supplying a list containing `ToolCallParams.INPUT_PARAMETERS` but excluding `ToolCallParams.OUTPUT`, will deem a tool correct if the tool name and input parameters match, even if the output does not. Defaults to an empty list.
- [Optional] `include_reason`: a boolean which when set to `True`, will include a reason for its evaluation score. Defaulted to `True`.
- [Optional] `strict_mode`: a boolean which when set to `True`, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted to `False`.
- [Optional] `verbose_mode`: a boolean which when set to `True`, prints the intermediate steps used to calculate said metric to the console, as outlined in the [How Is It Calculated](#how-is-it-calculated) section. Defaulted to `False`.
- [Optional] `should_consider_ordering`: a boolean which when set to `True`, will consider the ordering in which the tools were called in. For example, if `expected_tools=[ToolCall(name="ImageAnalysis"), ToolCall(name="ToolQuery"), ToolCall(name="ImageAnalysis")]` and `tools_called=[ToolCall(name="ImageAnalysis"), ToolCall(name="ImageAnalysis"), ToolCall(name="ToolQuery")]`, the metric will consider the tool calling to be incorrect. Only available for `ToolCallParams.TOOL` and defaulted to `False`.
- [Optional] `should_exact_match`: a boolean which when set to `True`, will require the `tools_called` and `expected_tools` to be exactly the same. Available for `ToolCallParams.TOOL` and `ToolCallParams.INPUT_PARAMETERS` and defaulted to `False`.

:::info
Since `should_exact_match` is a stricter criteria than `should_consider_ordering`, setting `should_consider_ordering` will have no effect when `should_exact_match` is set to `True`.
:::

## How Is It Calculated?

:::note
The `MultimodalToolCorrectnessMetric`, unlike all other `deepeval` metrics, is not calculated using any models or LLMs, and instead via exact matching between the `expected_tools` and `tools_called` parameters.
:::

The **multimodal tool correctness metric** score is calculated according to the following equation:

<Equation
  formula="\text{Tool Correctness} = \frac{\text{Number of Correctly Used Tools (or Correct Input Parameters/Outputs)}}{\text{Total Number of Expected Tools}}
"
/>

This metric assesses the accuracy of your agent's tool usage by comparing the `tools_called` by your multimodal LLM agent to the list of `expected_tools`. A score of 1 indicates that every tool utilized by your LLM agent was called correctly according to the list of `expected_tools`, `should_consider_ordering`, and `should_exact_match`, while a score of 0 signifies that none of the `tools_called` were called correctly.

:::info
If `exact_match` is not specified and `ToolCall.INPUT_PARAMETERS` is included in `evaluation_params`, correctness may be a percentage score based on the proportion of correct input parameters (assuming the name and output are correct, if applicable).
:::
